1 Executive Summary

  • The aim of this report is to find out what kind of correlations are between weather and alcohol sales in Sydney.
  • The main discoveries are …

where did we get the data and how about the datasets, what it is doing and what we did with it conclusions And then final conclusions max 400 words


2 Initial Data Analysis (IDA)

2.1 Source of Data

This report combines sales data from an off-premise liquor store and weather data from the Bureau of Meteorology. The liquor store is called The Drink Hive (http://www.thedrinkhive.com.au/). It is located in a small up-market shopping centre called Saporium in Rosebery, NSW Australia. The sales data is primary data source pulled straight from the POS software Vend (https://www.vendhq.com/). It quantitatively measures time of sale, number of items sold and price whilst also measuring categories such as staff-member and product type. This allows us to objective measurements assuming that all transactions have been processed (which we have been assured have). We acknowledge that this is a small sample and does not necessarily represent the trends of the Australian consumer. The limitation of this data is that it only reflects one store and the trends of customers to that store.

The second dataset is weather data obtained from from the Bureau of Meteorology (BOM) (http://www.bom.gov.au/climate/data/) website. We chose BOM as they provide the most accurate and detailed quantitative measure of temperature per day available in Australia. This secondary data is openly available on their website. They are the primary source of this data and we trust that this data is accurate. Possible issues we may have with this data include temperature variations within one day. Because transactions occur at different periods throughout a day, and there are often fluctuations in temperature throughout a day, it may alter the results.

2.1.1 Rosebery Demographic

Rosebery has a population of approximately 10,000 with a median age of 33 and 2,500 families. The median weekly income is $1,900 and weekly rent is $580. 34% of residents have a university or tertiary education which is twice that of NSW average and 67% work full-time. 58% of the population live in flats or apartments. The suburb has a diverse mix of cultures with Chinese and Greek ancestory heavily over-represented compared to the NSW Average. 63% of the population’s parents were both born overseas, just under double the NSW average. Eastern orthodox over-represented and Anglican under-represented.(ABS, 2016)

2.2 Domain Knowledge

This report looks at the effect of weather on alcohol sales in the hope of improving the likelihood of small business success. A business’s core role is to provide goods and services. Their success is dependent on their ability to sell products for revenues. It is therefore imperative that they 1. Stock the correct product 2. Market the correct product 3. Price their products effectively In the past, these decisions have been the intuition of a good business operator, however with the accessibility of data and analytics tools, we can now take a more scientific approach which can be replicated by any business.

There is a wealth of evidence which shows that weather influences consumer behaviour and understanding this leads to better marketing decisions (Murray et al, 2010). The key areas of consumer behaviour this report will address are what conditions weather effects consumer spending and what conditions effect consumer behaviour.

Weather is defined by Merriam as the state of the atmosphere with respect to heat or cold, wetness or dryness, calm or storm, clearness or cloudiness. Researchers indicate that weather alters the shoppers mood.

Research indicates that weather can effect consumer spending in three ways 1. Bad weather keeps people at home reducing foot traffic and sales 2. Weather can influence store traffic and sales volume, and 3. Weather can influence sales by affecting the customers internal state.

We will use this report to show small businesses how they can use their data to increase profitability by making informed decisions about stocking, marketing and pricing their products, helping them find the signal in the noise.

#This prevents strings from being factorised which keeps Date as Date
options(stringsAsFactors = FALSE)
# LOAD DATA
library(ggplot2)
library(tidyr)
library(xtable)
library(knitr)
## Warning: package 'knitr' was built under R version 3.3.3
data = read.csv("data/ProcessedData.csv")

head(data$Date)
## [1] "2018-08-06" "2018-08-06" "2018-08-06" "2018-08-06" "2018-08-06"
## [6] "2018-08-06"
#Convert date to date format
# convert date column to date type and format to Australian date standards
data$Date <- format(as.Date(data$Date), "%d-%m-%Y")

# view R class of data
class(data$DATE)
## [1] "NULL"
## [1] "Date"

# view results
head(data$DATE)
## NULL
## [1] "2013-08-21" "2013-08-26" "2013-08-27" "2013-09-01" "2013-09-09"
## [6] "2013-09-10"
# Quick look at top 6 rows of data
head(data)
##    X       Date Receipt.Number Total
## 1  0 06-08-2018          18179 27.00
## 2  4 06-08-2018          18178 24.00
## 3  7 06-08-2018          18177 47.00
## 4 11 06-08-2018          18176 27.00
## 5 16 06-08-2018          18175 31.05
## 6 21 06-08-2018          18174 33.99
##                                                                                    Details
## 1                                                                         1 X Cirillo Rose
## 2                                                     2 X Fever Tree Elderflower Tonic 4pk
## 3        2 X Ps40 Smoked Lemonade + 1 X Athletes of Wine Vino Athletico Macedon Pinot noir
## 4 1 X Empty Wine Bottle 750ml + 1 X Unico Zelo Harvest Sauvignon Blanc KEG + -1 X Discount
## 5          3 X Frenchies Kolsch 330ml + 3 X Frenchies Comet Pale Ale 330ml + -1 X Discount
## 6                                                1 X Domaine Thomson - Explorer Pinot Noir
##       Time Maximum.temperature..Degree.C. Rainfall.amount..millimetres.
## 1 18:42:40                             19                             0
## 2 18:03:54                             19                             0
## 3 17:45:58                             19                             0
## 4 17:32:56                             19                             0
## 5 16:26:31                             19                             0
## 6 15:05:09                             19                             0
## Size of data
dim(data)
## [1] 11870     8
## R's classification of data
class(data)
## [1] "data.frame"
## R's classification of variables
str(data)
## 'data.frame':    11870 obs. of  8 variables:
##  $ X                             : int  0 4 7 11 16 21 24 34 37 40 ...
##  $ Date                          : chr  "06-08-2018" "06-08-2018" "06-08-2018" "06-08-2018" ...
##  $ Receipt.Number                : chr  "18179" "18178" "18177" "18176" ...
##  $ Total                         : num  27 24 47 27 31.1 ...
##  $ Details                       : chr  "1 X Cirillo Rose" "2 X Fever Tree Elderflower Tonic 4pk" "2 X Ps40 Smoked Lemonade + 1 X Athletes of Wine Vino Athletico Macedon Pinot noir" "1 X Empty Wine Bottle 750ml + 1 X Unico Zelo Harvest Sauvignon Blanc KEG + -1 X Discount" ...
##  $ Time                          : chr  "18:42:40" "18:03:54" "17:45:58" "17:32:56" ...
##  $ Maximum.temperature..Degree.C.: num  19 19 19 19 19 19 19 19 19 19 ...
##  $ Rainfall.amount..millimetres. : num  0 0 0 0 0 0 0 0 0 0 ...
#sapply(mtcars, class)

2.3 Classification of Variables

The dataset contains 11870 sales where each row represents 1 sale. We are representing the population using a small sample described above. We have removed columns that are not relevant to our clients use. We are using a multivariant dataset that consists of 8. Of X, Date, Receipt.Number, Total, Details, Time, Maximum.temperature..Degree.C., Rainfall.amount..millimetres. For our research questions we will be looking at: “Date”, as an ordinal categorical measurement representing days in a week, months in a year and seasons in a year. The min is “Total”,

2.4 Stakeholders

2.4.1 Customer - Indirect External Stakeholder

The owner of The Drink Hive, Mal Higgs is a customer. He will receive this report and be able to act on its recommendations. Mal has over 40 years experience in retail liquor and therefore has a strong intuition and has made his own hypotheses/predictions as to the outcome. He came to us as he does not have a mathematical nor statistical background and his Point Of Sale (POS) software does not help with his marketing campaign. Mal is a useful resource for questions about retail and the customer however we must take care not to be bias and assess the data objectively. We must also present the report in a way that a non-technical person can understand.

2.4.2 Project Team - Direct Internal Stakeholder

The project tea is made up of Anthony, Science Major, Sindi, Ivan and Sam who are Computer Science majors. We have decided to break up the team into their strenghts and backgrounds whilst also cross-checking each other to ensure that every member of the team learns and we can challenge each others thinking. Each member of the project team will be working on the research questions presented as well as focusing on the below areas of the report.

Sam has a background in business and data analytics. He will focus on the Initial Data Analysis. He must be aware of the limitations of his understanding of data science to ensure the relevant variables are used. Anthony has a strong critical thinking mind and a background in scientific literature. He will focus on the evidence based conclusions, compilation and presentation of the report. He must be must be aware of his non-technical audience to present the findings in a way that can be understood by a small business owner. Sindri and Ivar have backgrounds in Computer Science. They will be combining and wrangling the datasets into a useable table and lead the data representations found in this report. They will be ensuring the numerical and graphical summaries are insightful and relevant. They must ensure the correct data points and graphical representations are used to portray the questions being researched.

We have a fantastic breadth of knowledge that will make this report a valuable and useful tool for Mal and other small business retail owners.

2.5 Summary

3 Research Questions


3.1 Research Question 1

How does the maximum temperature affect the consumer decision when purcahsing alcohol?

Let´s first see some general information about out heat variables.

#The temperature is a quantitative variable. We start by changing it to a qualitative one using ranges that cover 5 degrees Celcius
temp = data$Maximum.temperature..Degree.C.
data$tempGroups = cut(temp, c(10,15,20,25,30,35,40,45))

#Take a look at overall data before looking at the graphs
heatData = data %>% drop_na(Maximum.temperature..Degree.C.)

dataFrame <- data.frame(Rows = c(nrow(data)-nrow(heatData)),
                        Max = c(max(temp)),
                        Min = c(min(temp)),
                        Mean = c(mean(temp)),
                        Median = c(median(temp)))

kable(dataFrame, col.names = c("Missing rows", "Max heat", "Min heat", "Mean maximum heat", "Median of maximum heat"))
Missing rows Max heat Min heat Mean maximum heat Median of maximum heat
0 43.4 14.3 23.28904 23.2

Lets start at looking at how much a person spends in the store on average. The number we see inside each bar represents the number of transactions for that temperature category and the line in each bar represents the median of money spent. As we can see from the chart, people seem to spend more money in milder temperature (15 - 40 degrees) but in more extreme temperatures (10 - 15 and 40 - 45 degrees) people tend to buy less alcohol. We can also observe some sort of spike in the bar chart. People tend to spend more money when the temperature is between 35 and 40 degrees.

#Transaction sizes for each temperature range
meanPerPerson = aggregate(data$Total ~ data$tempGroups, data, mean)
medPerPerson = aggregate(data$Total ~ data$tempGroups, data, median)
transactions = merge(x = meanPerPerson, y = medPerPerson, by='data$tempGroups')
names(transactions) = c('Temperature', 'Mean_total', 'Median_total')

Fre <- as.data.frame(table(data$tempGroups))
colnames(Fre)[1] <- "tempGroups" 
Fre$lab <- as.character(Fre$Freq)

#Barplot for average money spent with median lines
ggplot(transactions, aes(Temperature, Mean_total), label = Fre$Freq) + geom_bar(stat="identity", position = "dodge", fill = "#FF6666") + ggtitle("Average money spent (in dollars) per purchase for different temperature") + ylab("Dollars") + theme_bw() + theme(plot.title = element_text(hjust = 0.5)) + geom_errorbar(data=transactions, aes(Temperature, ymax = Median_total, ymin = Median_total), size=1, linetype = "solid", inherit.aes = F, width = 0.9) + geom_text(aes(label = Fre$Freq), position = position_dodge(width = 0.9), vjust = 1.5) + scale_x_discrete(labels = c('10 - 15','15 - 20','20 - 25', '25 - 30', '30 - 35', '35 - 40', '40 - 45'))

Lets look at what an average day looks like in total sales for each temperature range. We again see similar trend as above. That in extreme temperature (10 - 15 and 40 - 45 degrees) the store doesn´t sell as much alcohol as in milder temperatures (15 - 40 degrees). We also see the same spike as above when the temperature is between 35 - 40 degrees.

 #Total money spent for each temperature range
 totalPerDay = aggregate(data$Total ~ data$tempGroups, data, sum)
 nrOfDaysPerTemp = aggregate(data$Date ~ data$tempGroups, data, function(x) length(unique(x)))
 totals = merge(x = totalPerDay, y = nrOfDaysPerTemp, by='data$tempGroups')
 names(totals) = c('Temperature', 'Total', 'NrOfDays')
 totals['meanPerDay'] = round(totals$Total / totals$NrOfDays, 1)
 
 ggplot(totals, aes(Temperature, meanPerDay)) + geom_bar(stat="identity", position = "dodge") + geom_bar(stat="identity", position = "dodge", fill = "#56B4E9") + ggtitle("Average money spent (dollars) in one day for different temperature") + ylab("Dollars") + theme_bw() + theme(plot.title = element_text(hjust = 0.5)) + scale_x_discrete(labels = c('10 - 15','15 - 20','20 - 25', '25 - 30', '30 - 35', '35 - 40', '40 - 45'))

3.1.1 Summary

Looking at the values of the median and mean purchase transactions it becomes evident that there is not much change in consumer behaviour over the temperature ranges 15-35 degrees. However, the more extreme temperature values have more of an effect. During the coldest times (10-15 degrees) there is a definite drop in the amount of money spent per purchase. During the very hottest periods 40-45 degrees there is also a massive drop in the amount spent on each transaction. However, it is worth noting that there were very few transactions during that time. Another very interesting spike in sales occurred at the 35-40 temperature range. This could be because people drink more alcohol, however, these would be regular temperatures during the summer break time when people are on vacation and are hence drinking more alcohol in general.

3.2 Research Question 2

How does rainfall affect the consumer decision when purchasing alcohol?

There are a few days for which we do not have rain data so we start by removing those rows. We then take a better look at the rain data Let’s start by taking a better look at the rainfall data

rainData = data %>% drop_na(Rainfall.amount..millimetres.)
rain = rainData$Rainfall.amount..millimetres.

dataFrame <- data.frame(Rows = c(nrow(data)-nrow(rainData)),
                        Max = c(max(rain)),
                        Min = c(min(rain)),
                        Mean = c(mean(rain)),
                        Median = c(median(rain)))

kable(dataFrame, caption = "Fuck yeah", col.names = c("Missing rows", "Max rainfall", "Min rainfall", "Mean rainfall", "Median of rainfall"))
Fuck yeah
Missing rows Max rainfall Min rainfall Mean rainfall Median of rainfall
74 69.4 0 1.665683 0
#We start by changing the rainfall from a quantitative variable to a qualitative one
rainData$rainGroups = cut(rain, c(0,0.1,15,70), include.lowest = TRUE)

#Transaction sizes for each temperature range
meanPerPerson = aggregate(rainData$Total ~ rainData$rainGroups, rainData, mean)
medPerPerson = aggregate(rainData$Total ~ rainData$rainGroups, data, median)
transactions = merge(x = meanPerPerson, y = medPerPerson, by='rainData$rainGroups')
names(transactions) = c('Rainfall', 'Mean_total', 'Median_total')

FreRain <- as.data.frame(table(rainData$rainGroups))
colnames(FreRain)[1] <- "rainGroups" 


#Barplot for average money spent with median lines
ggplot(transactions, aes(Rainfall, Mean_total)) + geom_bar(stat="identity", position = "dodge", fill = "#FF6666") + ggtitle("Average money spent (in dollars) per purchase for different rainfall") + ylab("Dollars") + xlab("Rainfall") + theme_bw() + theme(plot.title = element_text(hjust = 0.5)) + geom_errorbar(data=transactions, aes(Rainfall, ymax = Median_total, ymin = Median_total), size=1, linetype = "solid", inherit.aes = F, width = 0.9) + geom_text(aes(label = FreRain$Freq), position = position_dodge(width = 0.9), vjust = 1.5) + scale_x_discrete(labels = c('No rain','small rain', 'heavy rain'))

 #Number of transactions for each rainfall range
 ggplot(rainData, aes(rainGroups)) + geom_bar() + scale_x_discrete(labels = c('No rain','small rain', 'heavy rain')) + ylab("Number of transactions") + xlab("Rainfall") + theme_bw() + ggtitle("Number of total transaction for different rainfall") + theme(plot.title = element_text(hjust = 0.5))

 FreRain
##   rainGroups Freq
## 1    [0,0.1] 8274
## 2   (0.1,15] 3200
## 3    (15,70]  322
 #Total money spent for each temperature range
 totalPerDay = aggregate(rainData$Total ~ rainData$rainGroups, rainData, sum)
 nrOfDaysPerRain = aggregate(rainData$Date ~ rainData$rainGroups, rainData, function(x) length(unique(x)))
 rainTotals = merge(x = totalPerDay, y = nrOfDaysPerRain, by='rainData$rainGroups')
 names(rainTotals) = c('Rainfall', 'Total', 'NrOfDays')
 rainTotals['meanPerDay'] = round(rainTotals$Total / rainTotals$NrOfDays, 1)
 
 ggplot(rainTotals, aes(Rainfall, meanPerDay)) + geom_bar(stat="identity", position = "dodge") + scale_x_discrete(labels = c('No rain','small rain', 'heavy rain')) + ylab("Dollars") + xlab("Rainfall") + theme_bw() + ggtitle("Average money spent (dollars) in one day for different temperature") + theme(plot.title = element_text(hjust = 0.5))

3.2.1 Summary

  • There many more transactions on days that are not raining can be attributed to the fact that there are many more sunny days in Sydney than rainy days
  • People buy a little bit more when it rains.
  • The average money spent per transaction only rises slightly when there is any rain.
  • This may be because the store is located in a shopping centre. Maybe this would be different for stores that aren’t in a shopping centre.

3.3 Research Question 3

How does the time of year affect the consumer decision when purchasing alcohol?

Other possible research questions: What time of day do people buy their alcohol?

Insert text and analysis.

hour = as.integer(substr(data$Time, 0, 2))
data$timeGroups = cut(hour, seq(8,20,1))
#table(data$timeGroups)

#Total money made per time gap
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
timeTemp = data %>%
    group_by(timeGroups, tempGroups) %>% 
    summarize(total = sum(Total) ) 

names(nrOfDaysPerTemp) = c('tempGroups', 'nrOfDays')
timeTemp = merge(x= timeTemp, y=nrOfDaysPerTemp , on='tempGroups')
timeTemp['scaledTotal'] = timeTemp$total / timeTemp$nrOfDays
ggplot(timeTemp, aes(x = timeGroups, y = scaledTotal)) + geom_point() + geom_line(aes(group=tempGroups))

ggplot(timeTemp, aes(x = timeGroups, y = scaledTotal)) + geom_point() + facet_wrap(~tempGroups)

#ggplot(test, aes(x = timeGroups, y = total)) + geom_point() + geom_line(aes(group=tempGroups))


#average money spent by customer per transaction, note
#testMean = data %>%
#    group_by(timeGroups, tempGroups) %>% 
#    summarize(meanTotal = mean(Total) ) 
#ggplot(testMean, aes(x = timeGroups, y = meanTotal)) + geom_point() + facet_wrap(~tempGroups)

3.3.1 Summary:

TO DO:

4 Conclusions

TODO:


5 References

Australian Bureau of Statistics, 2016, 2016 Census QuickStats, Rosebery. viewed 21/08/2018 http://quickstats.censusdata.abs.gov.au/census_services/getproduct/census/2016/quickstat/SSC13405?opendocument

Gaukler, G. M. (2010). Preventing avoidable stockouts: The impact of item-level RFID in retail. Journal of Business & Industrial Marketing, 25(8), 572-581. doi:10.1108/08858621011088301

Kyle B. Murray, Fabrizio Di Muro, Adam Finn, Peter Popkowski Leszczyc, 2010. The effect of weather on consumer spending. Journal of Retailing and Consumer Services (17) 512-520

Weather. (n.d) In Merriam-Websters Collegiate Dictionary. Retrieved from https://www.merriam-webster.com/dictionary/weather


6 Personal reflection on group work

  • The way I contributed was
  • What I learnt about group work was …